In this tutorial, we present the process of time series analysis, starting with the visualization of time series using different examples and techniques. Then, we discuss the processing of time series data and the notion of stationarity of a time series, to finally end with the modeling and forecasting part using ARIMA models.
zoo (Z’s Ordered Observations) and xts (eXtensible Time Series) objects are the most used classes for representing time series in R. They allow to repesent time series data as a matrix of observations combined with an index of corresponding dates and times.
For static graphics, the lattice, latticeExtra, and ggplot2 will be used. We start by loading these packages and configuring some of them.
library(lattice)
library(ggplot2)
# latticeExtra must be loaded after ggplot2 to prevent masking of `layer`
library(latticeExtra)
library(RColorBrewer)
# lattice and latticeExtra configuration
myTheme <- custom.theme.2(
pch=19, cex=0.7, region=rev(brewer.pal(9, 'YlOrRd')),
symbol=brewer.pal(n=8, name="Dark2"))
myTheme$strip.background$col = myTheme$strip.shingle$col =
myTheme$strip.border$col = 'transparent'
myArgs <- list(
as.table=TRUE, between=list(x=0.5, y=0.2),
xscale.components = function(...)
modifyList(xscale.components.default(...), list(top=FALSE)),
yscale.components = function(...)
modifyList(yscale.components.default(...), list(right=FALSE)))
lattice.options(default.theme=myTheme, default.args=modifyList(
lattice.options()$default.args, myArgs))
library(zoo)
Two datasets available at the Data folder of the repository will be used. They are the following :
aranjuez.RData: representing eight years of daily data from a meteorological station located at Aranjuez (Madrid);
navarra.RData: representing measurements of solar radiation from different meteorological stations.
## loading from Github
# library(repmis)
#
# source_data("https://github.com/zaher-stat/R_in_Grenoble-Time_Series_Talk/blob/ZAHER_Ahmed/aranjuez.RData?raw=True")
# source_data("https://github.com/zaher-stat/R_in_Grenoble-Time_Series_Talk/blob/ZAHER_Ahmed/navarra.RData?raw=True")
## loading from local repository
load('Data/aranjuez.RData')
load('Data/navarra.RData')
For the first example we will use the eight years of daily data from a meteorological station located at Aranjuez (Madrid). This multivariate time series can be displayed with the xyplot method of lattice for zoo objects with a panel for each variable.
## The layout argument arranges panels in rows
xyplot(aranjuez, layout = c(1, ncol(aranjuez)))
The package ggplot2 provides the generic method autoplot to automate the display of certain classes with a simple command.
## facet_free allows each panel to have its own range
ggplot2::autoplot(aranjuez) + facet_free()
As an example of time series of variables with the same scale, we will use measurements of solar radiation from different meteorological stations (navarra Data).
The first attempt to display this multivariate time series makes use of the xyplot.zoo method. The objective of this graphic is to display the behavior of the collection as a whole: the series are superposed in the same panel (superpose=TRUE) without legend (auto.key=FALSE), using thin lines and partial transparency. Transparefncy softens overplotting problems and reveals density clusters because regions with more overlapping lines are darker. Next code displays the variations around the time average (avRad).
avRad <- zoo(rowMeans(navarra, na.rm = 1), index(navarra))
pNavarra <- xyplot(navarra - avRad,
superpose = TRUE, auto.key = FALSE,
lwd = 0.5, alpha = 0.3, col = 'midnightblue')
pNavarra
This result can be improved with the horizon graph. The horizon graph is useful in examining how a large number of series changes over time, and does so in a way that allows both comparisons between the individual time series and independent analysis of each series. Moreover, extraordinary behaviors and predominant patterns are easily distinguished.
Next code displays the variations of solar radiation around the time average with a horizon graph using a row for each time series. In the code we choose origin=0 and leave the argument horizonscale undefined (default). With this combination each panel has different scales and the colors in each panel represent deviations from the origin. This is depicted in the color key with the \(\Delta_i\) symbol, where the subscript i denotes the existence of multiple panels with different scales.
horizonplot(navarra - avRad,
layout = c(1, ncol(navarra)),
origin = 0, ## Deviations in each panel are calculated
## from this value
colorkey = TRUE,
col.regions = brewer.pal(6, "RdBu"))
The horizon graph is also useful in revealing the differences between a univariate time series and another reference. For example, we might be interested in the departure of the observed temperature from the long-term average, or in other words, the temperature change over time. Let’s illustrate this approach with the time series of daily average temperatures measured at the meteorological station of Aranjuez. The reference is the long-term daily average calculated with ave.
Ta <- aranjuez$TempAvg
timeIndex <- index(aranjuez)
longTa <- ave(Ta, format(timeIndex, '%j'))
diffTa <- (Ta - longTa)
The ggplot method requires a data.frame with the day, year, and month arranged in different columns.
year <- function(x)as.numeric(format(x, '%Y'))
day <- function(x)as.numeric(format(x, '%d'))
month <- function(x)as.numeric(format(x, '%m'))
df <- data.frame(Vals = diffTa,
Day = day(timeIndex),
Year = year(timeIndex),
Month = month(timeIndex))
The values (Vals column of this data.frame) are displayed as a level plot thanks to the geom_raster function.
library(scales)
## The packages scales is needed for the pretty_breaks function.
ggplot(data = df,
aes(fill = Vals,
x = Day,
y = Year)) +
facet_wrap(~ Month, ncol = 1, strip.position = 'left') +
scale_y_continuous(breaks = pretty_breaks()) +
scale_fill_distiller(palette = 'RdBu', direction = 1) +
geom_raster() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
This section describes the interactive alternatives of the static figures included in the previous sections with several packages: dygraphs, highcharter, and plotly. These packages are R interfaces to JavaScript libraries based on the htmlwidgets package.
Dygraphs
The dygraphs package is an interface to the dygraphs JavaScript library, and provides facilities for charting time-series. It works automatically with xts time series objects, or with objects than can be coerced to this class. The result is an interactive graph, where values are displayed according to the mouse position over the time series. Regions can be selected to zoom into a time period.
library(widgetframe)
library(htmlwidgets)
library(dygraphs)
# Original graph
dyTemp <- dygraph(aranjuez[, c("TempMin", "TempAvg", "TempMax")],
main = "Temperature in Aranjuez",
ylab = "ºC")
# You can customize dygraphs by piping additional commands onto the original graphic. The function dyOptions provides several choices for the graphic, and the function dyHighlight configures options for data series mouse-over highlighting. For example, with the next code the semi-transparency value of the non-selected lines is reduced and the width of the selected line is increased.
dyTemp %>%
dyHighlight(highlightSeriesBackgroundAlpha = 0.2,
highlightSeriesOpts = list(strokeWidth = 2))
Highcharter
The highcharter package is an interface to the highcharts JavaScript library, with a wide spectrum of graphics solutions. Displaying time series with this package can be achieved with the combination of the generic highchart function and several calls to the hc_add_series_xts function through the pipe %>% operator. Once again, the result is an interactive graph with selection and zoom capabilities.
library(highcharter)
library(xts)
aranjuezXTS <- as.xts(aranjuez)
highchart() %>%
hc_add_series(name = 'TempMax',
aranjuezXTS[, "TempMax"]) %>%
hc_add_series(name = 'TempMin',
aranjuezXTS[, "TempMin"]) %>%
hc_add_series(name = 'TempAvg',
aranjuezXTS[, "TempAvg"])
plotly
The plotly package is an interface to the plotly JavaScript library, also with a wide spectrum of graphics solutions. This package does not provide any function specifically focused on time series. Thus, the time series object has to be transformed into a data.frame including a column for the time index. If the data.frame is in wide format (one column per variable), each variable will be represented with a call to the add_lines function. However, if the data.frame is in long format (a column for values, and a column for variable names) only one call to add_lines is required. The next code follows this approach using the combination of fortify, to convert the zoo object into a data.frame, and melt, to transform from wide to long format.
aranjuezDF <- fortify(aranjuez[,
c("TempMax",
"TempAvg",
"TempMin")],
melt = TRUE)
summary(aranjuezDF)
## Index Series Value
## Min. :2004-01-01 TempMax:2898 Min. :-12.980
## 1st Qu.:2005-12-29 TempAvg:2898 1st Qu.: 7.107
## Median :2008-01-09 TempMin:2898 Median : 13.560
## Mean :2008-01-03 Mean : 14.617
## 3rd Qu.:2010-01-03 3rd Qu.: 21.670
## Max. :2011-12-31 Max. : 41.910
## NA's :10
Next code produces an interactive graphic with the generic function plot_ly connected with add_lines through the pipe operator %>%.
library(plotly)
plot_ly(aranjuezDF) %>%
add_lines(x = ~ Index,
y = ~ Value,
color = ~ Series)
For the sake of simplicity, we prefer to conduct the study here on a simple data set. We will therefore use data from The Time Series Data Library (TSDL) which was created by Rob Hyndman, professor of statistics at Monash University, Australia.
648 time series are available in this Data Library with different frequency levels:
# install.packages("devtools")
# devtools::install_github("FinYang/tsdl")
library(tsdl)
tsdl
## Time Series Data Library: 648 time series
##
## Frequency
## Subject 0.1 0.25 1 4 5 6 12 13 52 365 Total
## Agriculture 0 0 37 0 0 0 3 0 0 0 40
## Chemistry 0 0 8 0 0 0 0 0 0 0 8
## Computing 0 0 6 0 0 0 0 0 0 0 6
## Crime 0 0 1 0 0 0 2 1 0 0 4
## Demography 1 0 9 2 0 0 3 0 0 2 17
## Ecology 0 0 23 0 0 0 0 0 0 0 23
## Finance 0 0 23 5 0 0 20 0 2 1 51
## Health 0 0 8 0 0 0 6 0 1 0 15
## Hydrology 0 0 42 0 0 0 78 1 0 6 127
## Industry 0 0 9 0 0 0 2 0 1 0 12
## Labour market 0 0 3 4 0 0 17 0 0 0 24
## Macroeconomic 0 0 18 33 0 0 5 0 0 0 56
## Meteorology 0 0 18 0 0 0 17 0 0 12 47
## Microeconomic 0 0 27 1 0 0 7 0 1 0 36
## Miscellaneous 0 0 4 0 1 1 3 0 1 0 10
## Physics 0 0 12 0 0 0 4 0 0 0 16
## Production 0 0 4 14 0 0 28 1 1 0 48
## Sales 0 0 10 3 0 0 24 0 9 0 46
## Sport 0 1 1 0 0 0 0 0 0 0 2
## Transport and tourism 0 0 1 1 0 0 12 0 0 0 14
## Tree-rings 0 0 34 0 0 0 1 0 0 0 35
## Utilities 0 0 2 1 0 0 8 0 0 0 11
## Total 1 1 300 64 1 1 240 3 16 21 648
We select demographic data and then we choose the data of the age of death of successive kings of England, which we will study further :
# Select demographic datas
Demographic_datas <- subset(tsdl,"Demography")
Demographic_datas
## Time Series Data Library: 17 Demography time series
##
## Frequency
## Subject 0.1 1 4 12 365 Total
## Demography 1 9 2 3 2 17
# Choose data of the age of death of successive kings of England
kings_TS <- Demographic_datas[[8]]
kings_TS
## Time Series:
## Start = 1
## End = 42
## Frequency = 1
## [1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48
## [26] 59 86 55 68 51 33 49 67 77 81 67 71 81 68 70 77 56
## attr(,"source")
## [1] Hipel and McLeod (1994)
## attr(,"description")
## [1] Age of death of successive kings of England, starting with William the Conqueror
## attr(,"subject")
## [1] Demography
Data is already defined as ts object, we can then plot it using plot.ts() function:
plot.ts(kings_TS, ylab="Age of death", xlab="Year")
Before starting the modelisation step using ARIMA models, we should first check for stationarity in our time series because ARIMA models are defined for stationary time series. Therefore, if you start off with a non-stationary time series, you will first need to ‘difference’ the time series until you obtain a stationary time series. If you have to difference the time series d times to obtain a stationary series, then you have an \(ARIMA(p,d,q)\) model, where d is the order of differencing used.
You can difference a time series using the diff() function in R :
kings_TS_diff1 <- diff(kings_TS, differences = 1)
plot(kings_TS_diff1)
The optimal number of diffs is given automaticlly using ndiffs() function from forecast package :
library(forecast)
ndiffs(x = kings_TS)
## [1] 1
Augmented Dickey-Fuller test
Statistically, we can apply the Augmented Dickey Fuller test ( ADF test ) to test for stationarity. The null hypothesis tests the existence of a unit root in the time series and therefore the series is non-stationary. Otherwise, the rejection of the null hypothesis indicates that the series is indeed stationary.
# Stationarity check
library(tseries)
## Augmented Dickey-Fuller test : p-value < 0.05 indicates that TS is stationary
adf.test(kings_TS)
##
## Augmented Dickey-Fuller Test
##
## data: kings_TS
## Dickey-Fuller = -2.1132, Lag order = 3, p-value = 0.529
## alternative hypothesis: stationary
If your time series is stationary, or if you have transformed it to a stationary time series by differencing d times, the next step is to select the appropriate ARIMA model, which means finding the values of most appropriate values of p and q for an ARIMA(p,d,q) model. To do this, you usually need to examine the correlogram and partial correlogram of the stationary time series.
To plot a correlogram and partial correlogram, we can use the Acf() and Pacf() functions in R, respectively. To get the actual values of the autocorrelations and partial autocorrelations, we set plot=FALSE in the two functions.
Acf(kings_TS_diff1, lag.max=20) # plot a correlogram
Acf(kings_TS_diff1, lag.max=20, plot=FALSE) # get the autocorrelation values
##
## Autocorrelations of series 'kings_TS_diff1', by lag
##
## 0 1 2 3 4 5 6 7 8 9 10
## 1.000 -0.360 -0.162 -0.050 0.227 -0.042 -0.181 0.095 0.064 -0.116 -0.071
## 11 12 13 14 15 16 17 18 19 20
## 0.206 -0.017 -0.212 0.130 0.114 -0.009 -0.192 0.072 0.113 -0.093
The autocorrelation at lag 1 is significant (exceeds significance limits) and equal to -0.360. All other autocorrelations between lags 1-20 are not significant.
pacf(kings_TS_diff1, lag.max=20) # plot a partial correlogram
Pacf(kings_TS_diff1, lag.max=20, plot=FALSE) # get the partial autocorrelation values
##
## Partial autocorrelations of series 'kings_TS_diff1', by lag
##
## 1 2 3 4 5 6 7 8 9 10 11
## -0.360 -0.335 -0.321 0.005 0.025 -0.144 -0.022 -0.007 -0.143 -0.167 0.065
## 12 13 14 15 16 17 18 19 20
## 0.034 -0.161 0.036 0.066 0.081 -0.005 -0.027 -0.006 -0.037
The partial correlogram shows significant negative partial autocorrelations at lags 1 (-0.360), 2 (-0.335) and 3 (-0.321). The partial autocorrelations tail off to zero after lag 3.
Strategy for selecting the best model:
In general, we consider that the lag after which the partial correllogram tails off to zero informs about the maximum order that can be reached by the autoregressive part of the ARIMA model, and similarly the correllogram gives the maximum order that can be reached by the moving average part. Therefore, the following specifications can be tested:
an \(ARMA(3,0)\) model, that is, an autoregressive model of order p=3, since the partial autocorrelogram is zero after lag 3, and the autocorrelogram tails off to zero (although perhaps too abruptly for this model to be appropriate)
an \(ARMA(0,1)\) model, that is, a moving average model of order q=1, since the autocorrelogram is zero after lag 1 and the partial autocorrelogram tails off to zero
an \(ARMA(p,q)\) model, that is, a mixed model with p and q greater than 0, since the autocorrelogram and partial correlogram tail off to zero (although the correlogram probably tails off to zero too abruptly for this model to be appropriate)
We use the principle of parsimony to decide which model is best: that is, we assume that the model with the fewest parameters is best. The \(ARMA(3,0)\) model has 3 parameters, the \(ARMA(0,1)\) model has 1 parameter, and the \(ARMA(p,q)\) model has at least 2 parameters. Therefore, the \(ARMA(0,1)\) model is taken as the best model.
==> An \(ARMA(0,1)\) model is a moving average model of order 1, or MA(1) model. This model can be written as: \[X_t - \mu = Z_t - (\theta * Z_{t-1})\]
Where \(X_t\) is the stationary time series we are studying (the first differenced series of ages at death of English kings), \(\mu\) is the mean of time series \(X_t\), \(Z_t\) is white noise with mean zero and constant variance, and \(\theta\) is a parameter that can be estimated.
The results of estimated model are as follow:
library(forecast)
fit <- Arima(kings_TS, order = c(0,1,1))
summary(fit)
## Series: kings_TS
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.7218
## s.e. 0.1208
##
## sigma^2 estimated as 236.2: log likelihood=-170.06
## AIC=344.13 AICc=344.44 BIC=347.56
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.9712931 14.99836 11.92162 -10.40664 29.5176 0.7496724 0.05350284
Automatic selection:
All these steps can be done automatically with the auto.arima() function defined in the forecast package. The results also show that the ARIMA(0,1,1) model is the most adapted to model the series:
library(forecast)
## Automatic processing
fit <- auto.arima(kings_TS)
summary(fit)
## Series: kings_TS
## ARIMA(0,1,1)
##
## Coefficients:
## ma1
## -0.7218
## s.e. 0.1208
##
## sigma^2 estimated as 236.2: log likelihood=-170.06
## AIC=344.13 AICc=344.44 BIC=347.56
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.9712931 14.99836 11.92162 -10.40664 29.5176 0.7496724 0.05350284
forecast() function is used to make prediction for the next five following periodes (short-term forecasts):
forecastedValues <- forecast(fit, h=5)
print(forecastedValues)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 43 67.75063 48.05479 87.44646 37.62845 97.87281
## 44 67.75063 47.30662 88.19463 36.48422 99.01703
## 45 67.75063 46.58489 88.91637 35.38042 100.12084
## 46 67.75063 45.88696 89.61429 34.31304 101.18822
## 47 67.75063 45.21064 90.29062 33.27869 102.22257
plot(forecastedValues)
Model validation:
To check the validity of thr model hypothesis, we investigate whether the forecast errors of the ARIMA model are normally distributed with mean zero and constant variance, and whether the are correlations between successive forecast errors.
For example, we can make a correlogram of the forecast errors for our \(ARIMA(0,1,1)\) model for the ages at death of kings, and perform the Ljung-Box test for lags 1-20, by typing:
Acf(forecastedValues$residuals, lag.max=20)
Box.test(forecastedValues$residuals, lag=20, type="Ljung-Box")
##
## Box-Ljung test
##
## data: forecastedValues$residuals
## X-squared = 13.584, df = 20, p-value = 0.8509
None of the sample autocorrelations for lags 1-20 exceed the significance bounds, and the p-value for the Ljung-Box test is 0.9, we can conclude that there is very little evidence for non-zero autocorrelations in the forecast errors at lags 1-20.
To investigate whether the forecast errors are normally distributed with mean zero and constant variance, we can make a time plot and histogram (with overlaid normal curve) of the forecast errors:
plot.ts(forecastedValues$residuals) # make time plot of forecast errors
# Histogramme avec la courbe de distribution
plotForecastErrors <- function(forecasterrors)
{
# make a histogram of the forecast errors:
mybinsize <- IQR(forecasterrors)/4
mysd <- sd(forecasterrors)
mymin <- min(forecasterrors) - mysd*5
mymax <- max(forecasterrors) + mysd*3
# generate normally distributed data with mean 0 and standard deviation mysd
mynorm <- rnorm(10000, mean=0, sd=mysd)
mymin2 <- min(mynorm)
mymax2 <- max(mynorm)
if (mymin2 < mymin) { mymin <- mymin2 }
if (mymax2 > mymax) { mymax <- mymax2 }
# make a red histogram of the forecast errors, with the normally distributed data overlaid:
mybins <- seq(mymin, mymax, mybinsize)
hist(forecasterrors, col="red", freq=FALSE, breaks=mybins)
# freq=FALSE ensures the area under the histogram = 1
# generate normally distributed data with mean 0 and standard deviation mysd
myhist <- hist(mynorm, plot=FALSE, breaks=mybins)
# plot the normal curve as a blue line on top of the histogram of forecast errors:
points(myhist$mids, myhist$density, type="l", col="blue", lwd=2)
}
plotForecastErrors(forecastedValues$residuals)
The time plot of the in-sample forecast errors shows that the variance of the forecast errors seems to be roughly constant over time (though perhaps there is slightly higher variance for the second half of the time series). The histogram of the time series shows that the forecast errors are roughly normally distributed and the mean seems to be close to zero. Therefore, it is plausible that the forecast errors are normally distributed with mean zero and constant variance.
Since successive forecast errors do not seem to be correlated, and the forecast errors seem to be normally distributed with mean zero and constant variance, the \(ARIMA(0,1,1)\) does seem to provide an adequate predictive model for the ages at death of English kings.
References: